e-Chapter 8 


Support Vector Machines 


Linear models are powerful. The nonlinear transform and the neural network 
(a cascade of linear models) are tools that increase their expressive power. 
Increasing the expressive power has a price: overfitting and computation time. 
Can we get the expressive power without paying the price? The answer is yes. 
Our new model, the Support Vector Machine (SVM), uses a ‘safety cushion’ 
when separating the data. As a result, SVM is more robust to noise; this 
helps combat overfitting. In addition, SVM can work seamlessly with a new 
powerful tool called the kernel: a computationally efficient way to use high 
(even infinite!) dimensional nonlinear transforms. When we combine the 
safety cushion of SVM with the ‘kernel trick’, the result is an efficient powerful 
nonlinear model with automatic regularization. The SVM model is popular 
because it performs well in practice and is easy to use. This chapter presents 
the mathematics and algorithms that make SVM work. 


8.1 The Optimal Hyperplane 
Let us revisit the perceptron model from Chapter 1. The illustrations below 


on a toy data set should help jog your memory. In 2 dimensions, a perceptron 
attempts to separate the data with a line, which is possible in this example. 
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As you can see, many lines separate the data and the Perceptron Learning 
Algorithm (PLA) finds one of them. Do we care about which one PLA finds? 
All separators have Ein = 0, so the VC analysis in Chapter 2 gives the same 
Four-bound for every separator. Well, the VC bound may say one thing, but 
surely our intuition says that the rightmost separator is preferred ©). 

Let’s try to pin down an argument that supports our intuition. In practice, 
there are measurement errors — noise. Place identical shaded regions around 
each data point, with the radius of the region being the amount of possible 
measurement error. The true data point can lie anywhere within this ‘region 
of uncertainty’ on account of the measurement error. A separator is ‘safe’ with 
respect to the measurement error if it classifies the true data points correctly. 
That is, no matter where in its region of uncertainty the true data point lies, it 
is still on the correct side of the separator. The figure below shows the largest 
measurement errors which are safe for each separator. 
































A separator that can tolerate more measurement error is safer. The right- 
most separator tolerates the largest error, whereas for the leftmost separator, 
even a small error in some data points could result in a misclassification. In 
Chapter 4, we saw that noise (for example measurement error) is the main 
cause of overfitting. Regularization helps us combat noise and avoid overfit- 
ting. In our example, the rightmost separator is more robust to noise without 
compromising Fin; it is better ‘regularized’. Our intuition is well justified. 

We can also quantify noise tolerance from the viewpoint of the separator. 
Place a cushion on each side of the separator. We call such a separator with a 
cushion fat, and we say that it separates the data if no data point lies within 
its cushion. Here is the largest cushion we can place around each of our three 
candidate separators. 
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To get the thickest cushion, keep extending the cushion equally on both sides 
of the separator until you hit a data point. The thickness reflects the amount 
of noise the separator can tolerate. If any data point is perturbed by at 
most the thickness of either side of the cushion, it will still be on the correct 
side of the separator. The maximum thickness (noise tolerance) possible for 
a separator is called its margin. Our intuition tells us to pick the fattest 
separator possible, the one with maximum margin. In this section, we will 
address three important questions: 


1. Can we efficiently find the fattest separator? 

2. Why is a fat separator better than a thin one? 

3. What should we do if the data is not separable? 
The first question relates to the algorithm; the second relates to Eout; and, 
the third relates to Ein (we will also elaborate on this question in Section 8.4). 


Our discussion was in 2 dimensions. In higher dimensions, the separator 
is a hyperplane and our intuition still holds. Here is a warm-up. 


Exercise 8.1 
Assume D contains two data points (x+, +1) and (x_,—1). Show that: 
(a) No hyperplane can tolerate noise radius greater than 3||x+ — x- ||. 


(b) There is a hyperplane that tolerates a noise radius $||x+ — x- ||. 


8.1.1 Finding the Fattest Separating Hyperplane 


Before we proceed to the mathematics, we fix a convention that we will use 
throughout the chapter. Recall that a hyperplane is defined by its weights w. 


Isolating the bias. We explicitly split the weight vector w as follows. The 
bias weight wo is taken out, and the rest of the weights w1,...,wq remain in 
w. The reason is that the mathematics will treat these two types of weights 
differently. To avoid confusion, we will relabel wo to b (for bias), but continue 
to use w for (wi,..., wa). 





Previous Chapters This Chapter 
x€ {1} x Ri; we RW! x€R?¢; BER, we R? 
b = bias 
I wo 
rı w Tı wi 
x= ; w=]. x= : w = 
A5 w Ta wa 
h(x) = sign(w7x) h(x) = sign(w7x + b) 
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A maximum-margin separating hyperplane has two defining properties. 


1. It separates the data. 
2. It has the thickest cushion among hyperplanes that separate the data. 


To find a separating hyperplane with maximum margin, we first re-examine 
the definition of a separating hyperplane and reshape the definition into an 
equivalent, more convenient one. Then, we discuss how to compute the margin 
of any given separating hyperplane (so that we can find the one with maximum 
margin). As we observed in our earlier intuitive discussion, the margin is 
obtained by extending the cushion until you hit a data point. That is, the 
margin is the distance from the hyperplane to the nearest data point. We 
thus need to become familiar with the geometry of hyperplanes; in particular, 
how to compute the distance from a data point to the hyperplane. 


Separating hyperplanes. The hyper- 
plane h, defined by (b,w), separates the 





"Xn +b>0 
data if and only if for n = 1,...,N, ee 
Yn(w"x, + b) > 0. (8.1) | _ 
The signal y,,(w7x,, +b) is positive for each <i = pe 
data point. However, the magnitude of the ae i 
signal is not meaningful by itself since we x 
can make it arbitrarily small or large for the w'x, +b <0 


same hyperplane by rescaling the weights 
and the bias. This is because (b, w) is the 
same hyperplane as (b/p, w/p) for any p > 
0. By rescaling the weights, we can control 
the size of the signal for our data points. Let us pick a particular value of p, 
= i A b 
ag ee 
which is positive because of (8.1). Now, rescale the weights to obtain the same 
hyperplane (b/p, w/p). For these rescaled weights, 


1 
min Yn (= Xn + 2) => min y,(w’x, +b) = Esi; 
F p =1,... p 


Thus, for any separating hyperplane, it is always possible to choose weights 
so that all the signals y,(w*x, + b) are of magnitude greater than or equal 
to 1, with equality satisfied by at least one (Xn, Yn). This motivates our new 
definition of a separating hyperplane. 


Definition 8.1 (Separating Hyperplane). The hyperplane h separates the 
data if and only if it can be represented by weights (b, w) that satisfy 


min Yn(w'x, +b) = 1. (8.2) 


Cs ener 
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The conditions (8.1) and (8.2) are equivalent. Every separating hyperplane 
can be accommodated under Definition 8.1. All we did is constrain the way 
we algebraically represent such a hyperplane by choosing a (data dependent) 
normalization for the weights, to ensure that the magnitude of the signal is 
meaningful. Our normalization in (8.2) will be particularly convenient for de- 
riving the algorithm to find the maximum-margin separator. The next exercise 
gives a concrete example of re-normalizing the weights to satisfy (8.2). 


Exercise 8.2 


Consider the data below and a ‘hyperplane’ (b, w) that separates the data. 


0 0 —1 12 
= || 2 y=]|-1 E b= —0.5 
2 0 spl i 


(a) Compute p = min r Yn(w™Xn + b). 


(b) Compute the weights 1(b, w) and show that they satisfy (8.2). 
(c) Plot both hyperplanes to show that they are the same separator. 


Margin of a hyperplane. To compute 
the margin of a separating hyperplane, we 
need to compute the distance from the hy- 
perplane to the nearest data point. As a 
start, let us compute the distance from an 
arbitrary point x to a separating hyper- 
plane h = (b,w) that satisfies (8.2). De- 
note this distance by dist(x, h). Referring Pi 
to the figure on the right, dist(x, h) is the 
length of the perpendicular from x to h. 
Let x’ be any point on the hyperplane, 
which means w'x’ +b = 0. Let u be a unit 
vector that is normal to the hyperplane h. 
Then, dist(x, h) = |u’(x—x’)|, the projection of the vector (x—x’) onto u. We 
now argue that w is normal to the hyperplane, and so we can take u = w/||w]|. 
Indeed, any vector lying on the hyperplane can be expressed by (x” — x’) for 
some x’, x” on the hyperplane, as shown. Then, using wx = —b for points 
on the hyperplane, 








w(x” — x’) = wx” — wx’ = —b +b = 0. 


Therefore, w is orthogonal to every vector in the hyperplane, hence it is the 
normal vector as claimed. Setting u = w/||w||, the distance from x to h is 


wx — w'x’ w'x +b 
ieee E E 
[wl] [wl] 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:8—5 











e-8. SUPPORT VECTOR MACHINES 8.1. THE OPTIMAL HYPERPLANE 


where we used wx’ = —b in the last step, since x’ is on h. You can now 
see why we separated the bias b from the weights w: the distance calculation 
treats these two parameters differently. 

We are now ready to compute the margin of a separating hyperplane. 
Consider the data points x;,...,xy, and hyperplane h = (b, w) that satisfies 
the separating condition (8.2). Since y, = +1, 





|w'Xn + | = |Yn(W'Xn + b)| = yn(w'Xn +b), 


where the last equality follows because (b, w) separates the data, which implies 
that yn(w"x, + b) is positive. So, the distance of a data point to h is 


dist(x,,h) = 
Therefore, the data point that is nearest to the hyperplane has distance 


1 
min dist(x,,h) = ——- min yn(w'x, +b) = — 
n=1,..,N wl] n=1,....N || w|| 


where the last equality follows because (b,w) separates the data and satis- 
fies (8.2). This simple expression for the distance of the nearest data point to 
the hyperplane is the entire reason why we chose to normalize (b, w) as we did, 
by requiring (8.2).! For any separating hyperplane satisfying (8.2), the margin 
is 1/||w]]. If you hold on a little longer, you are about to reap the full benefit, 
namely a simple algorithm for finding the optimal (fattest) hyperplane. 


The fattest separating hyperplane. The maximum-margin separating 
hyperplane (b*,w*) is the one that satisfies the separation condition (8.2) 
with minimum weight-norm (since the margin is the inverse of the weight- 


norm). Instead of minimizing the weight-norm, we can equivalently minimize 


iww, which is analytically more friendly. Therefore, to find this optimal 


hyperplane, we need to solve the following optimization problem. 


minimize: 4w”w (8.3) 
b,w 
subject to: min Yn(w'x, +b) =1. 
n=1,..., 


The constraint ensures that the hyperplane separates the data as per (8.2). 
Observe that the bias b does not appear in the quantity being minimized, but 
it is involved in the constraint (again, b is treated differently than w). To make 
the optimization problem easier to solve, we can replace the single constraint 


1Some terminology: the parameters (b, w) of a separating hyperplane that satisfy (8.2) 
are called the canonical representation of the hyperplane. For a separating hyperplane in 
its canonical representation, the margin is just the inverse norm of the weights. 
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minn Yn(w?X, + b) = 1 with N ‘looser’ constraints y,(w*x, + b) > 1 for 
n=1,...,N and solve the optimization problem: 
minimize: 4w'w (8.4) 
b,w 
subject to: Yn(w?x, +b) >1 (n=1,---,N). 


The constraint in (8.3) implies the constraints in (8.4), which means that 
the constraints in (8.4) are looser. Fortunately, at the optimal solution, the 
constraints in (8.4) become equivalent to the constraint in (8.3) as long as 
there are both positive and negative examples in the data. After solving (8.4), 
we will show that the constraint of (8.3) is automatically satisfied. This means 
that we will also have solved (8.3). 

To do that, we will use a proof by contradiction. Suppose that the solution 
(b*, w*) of (8.4) has 


p* = min yn(w* xn +b*) > 1, 
n 


and therefore is not a solution to (8.3). Consider the rescaled hyperplane 
(b,w) = = (b*, w*), which satisfies the constraints in (8.4) by construction. 


For (b,w), we have that ||w|| = + 








w*|| < |/w*|| (unless w* = 0), which 
means that w* cannot be optimal for (8.4) unless w* = 0. It is not possible 
to have w* = 0 since this would not correctly classify both the positive and 
negative examples in the data. 

We will refer to this fattest separating hyperplane as the optimal hyper- 
plane. To get the optimal hyperplane, all we have to do is solve the optimiza- 
tion problem in (8.4). 


Example 8.2. The best way to get a handle on what is going on is to carefully 
work through an example to see how solving the optimization problem in (8.4) 
results in the optimal hyperplane (b*,w*). In two dimensions, a hyperplane 
is specified by the parameters (b, w1,w2). Let us consider the toy data set 
that was the basis for the figures on page 8-2. The data matrix and target 
values, together with the separability constraints from (8.4) are summarized 
below. The inequality on a particular row is the separability constraint for the 
corresponding data point in that row. 








00 zi -b>1 (i) 
sel2 2 ai (2w +2w2 +b) >1 (ii) 
ee o| 7j 2w; +b>1 (iii) 
3 0 +1 3wı+b>1 (iv) 
Combining (i) and (iii) gives 
Wi > 1 
Combining (ii) and (iii) gives 
W2 < —l. 
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This means that $(w? + w2) > 1 with equality when w; = 1 and wz = —1. 
One can easily verify that 


(b* =—-l,wj =1,w3 77 ) 


satisfies all four constraints, minimizes 4(w? + w2), and therefore gives the 


optimal hyperplane. The optimal hyperplane is shown in the following figure. 





Optimal Hyperplane 


g(x) = sign(zı — x2 — 1) 


a ~ 0.707. 


Iw v2 

Data points (i), (ii) and (iii) are boxed be- 
cause their separation constraints are ex- 
actly met: yn(w*"xn + b*) = 1. 





margin: 








For data points which meet their constraints exactly, dist(xn, g) = ToT: These 
data points sit on the boundary of the cushion and play an important role. 
They are called support vectors. In a sense, the support vectors are ‘support- 


ing’ the cushion and preventing it from expanding further. 














Exercise 8.3 


For separable data that contain both positive and negative examples, and 
a separating hyperplane h, define the positive-side margin p+(h) to be 
the distance between h and the nearest data point of class +1. Similarly, 
define the negative-side margin p_(h) to be the distance between h and the 
nearest data point of class —1. Argue that if h is the optimal hyperplane, 
then p,(h) = p_(h). That is, the thickness of the cushion on either side 
of the optimal h is equal. 


We make an important observation that will be useful later. In Example 8.2, 
what happens to the optimal hyperplane if we removed data point (iv), the 
non-support vector? Nothing! The hyperplane remains a separator with the 
same margin. Even though we removed a data point, a larger margin cannot 
be achieved since all the support vectors that previously prevented the margin 
from expanding are still in the data. So the hyperplane remains optimal. In- 
deed, to compute the optimal hyperplane, only the support vectors are needed; 
the other data could be thrown away. 


Quadratic Programming (QP). For bigger data sets, manually solving 
the optimization problem in (8.4) as we did in Example 8.2 is no longer feasible. 
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The good news is that (8.4) belongs to a well-studied family of optimization 
problems known as quadratic programming (QP). Whenever you minimize a 
(convex) quadratic function, subject to linear inequality constraints, you can 
use quadratic programming. Quadratic programming is such a well studied 
area that excellent, publicly available solvers exist for many numerical comput- 
ing platforms. We will not need to know how to solve a quadratic programming 
problem; we will only need to know how to take any given problem and con- 
vert it into a standard form which is then input to a QP-solver. We begin by 
describing the standard form of a QP-problem: 


. . . 1 T T 
minimize: su'Qu+pu 8.5 
ucRE 2 Q P ( ) 
subject to: a u > Cm (m =91,--- Shy 


The variables to be optimized are the components of the vector u which has 
dimension L. All the other parameters Q, p, am and cm for m = 1,...,M 
are user-specified. The quantity being minimized contains a quadratic term 
su™Qu and a linear term p'u. The coefficients of the quadratic terms are 
in the L x L matrix Q: $u7Qu = wr 4 > qiju;uj. The coefficients of 
the linear terms are in the L x 1 vector p: pu = SE] piui. There are M 
linear inequality constraints, each one being specified by an L x 1 vector am 
and scalar cm. For the QP-problem to be convex, the matrix Q needs to be 
positive semi-definite. It is more convenient to specify the a,, as rows of an 
M x L matrix A and the cm as components of an M x 1 vector c: 


way a 
, Sv lal 
(8. 


Using the matrix representation, the QP-problem in 


C1 


5) is written as 
minimize: 4u"Qu + pu (8.6) 
ucR? 
subject to: Au >c. 


The M inequality constraints are all contained in the single vector constraint 
Au > c (which must hold for each component). We will write 


u“ — QP(Q, p, A, c) 


to denote the process of running a QP-solver on the input (Q, p, A, c) to get 
an optimal solution u* for (8.6).? 


2A quick comment about the input to QP-solvers. Some QP-solvers take the reverse 
inequality constraints Au < c, which simply means you need to negate A and c in (8.6). 
An equality constraint is two inequality constraints (a = b < > a > banda < b). Some 
solvers will accept separate upper and lower bound constraints on each component of u. 
All these different variants of quadratic programming can be represented by the standard 
problem with linear inequality constraints in (8.6). However, special types of constraints 
like equality and upper/lower bound constraints can often be handled in a more numerically 
stable way than the general linear inequality constraint. 
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We now show that our optimization problem in (8.4) is indeed a QP- 
problem. To do so, we must identify Q, p, A and c. First, observe that the 
quantities that need to be solved for (the optimization variables) are (b, w); 
collecting these into u, we identify the optimization variable u = [2] € R41. 
The dimension of the optimization problem is L = d+1. The quantity we 


are minimizing is sw'w. We need to write this in the form su™Qu + pu. 
Observe that 


0 OF b 0 OF 

E = T d aT d 

ww [b w ] B | er =u B a u, 

where Iq is the d x d identity matrix and O4 the d-dimensional zero vector. We 
0 o7 
Og is 
so p = 0441. As for the constraints in (8.4), there are N of them in (8.4), so 
M = N. The n-th constraint is y,(w7x,, + b) > 1, which is equivalent to 


can identify the quadratic term with Q = | ] , and there is no linear term, 


[yn Yn, | ul, 


and that corresponds to setting af = yn |1 x7] and cn = 1 in (8.5). So, the 
matrix A contains rows which are very related to an old friend from Chapter 3, 
the data matrix X augmented with a column of 1s. In fact, the n-th row of 
A is just the n-th row of the data matrix but multiplied by its label yn. The 
constraint vector c = 1y, an N-dimensional vector of ones. 


Exercise 8.4 


Let Y be an N x N diagonal matrix with diagonal entries Ynn = yn (a 
matrix version of the target vector y). Let X be the data matrix augmented 
with a column of 1s. Show that 


Ac VOX 


We summarize below the algorithm to get an optimal hyperplane, which re- 
duces to a QP-problem in d+ 1 variables with N constraints. 


Linear Hard-Margin SVM with QP 


1: Let p = Oa+1 ((d + 1)-dimensional zero vector) and c = 
1y (N-dimensional vector of ones). Construct matrices Q 
and A, where 


yı Yxi — 


YN —YNXN 
<—————— 


signed data matrix 


* 


: Calculate A =u’ + QP(Q,p,A,c). 





: Return the hypothesis g(x) = sign(w**x + b*). 
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The learning model we have been discussing is known as the linear hard- 
margin support vector machine (linear hard-margin SVM). Yes, it sounds like 
something out of a science fiction book ©). Don’t worry, it is just a name that 
describes an algorithm for constructing an optimal linear model, and in this 
algorithm only a few of the data points have a role to play in determining the 
final hypothesis - these few data points are called support vectors, as mentioned 
earlier. The margin being hard means that no data is allowed to lie inside the 
cushion. We can relax this condition, and we will do so later in this chapter. 


Exercise 8.5 
Show that the matrix Q described in the linear hard-margin SVM algorithm 
above is positive semi-definite (that is u’Qu > 0 for any u). 


The result means that the QP-problem is convex. Convexity is useful be- 
cause this makes it ‘easy’ to find an optimal solution. In fact, standard 
QP-solvers can solve our convex QP-problem in O((N + d)?). 


Example 8.3. We can explicitly construct the QP-problem for our toy exam- 
ple in Example 8.2. We construct Q, p, A, and c as follows. 


aca 5 i o 0 1 
Ceo. Toh eo! as È = E e= |) 
001 0 wy: s : 


A standard QP-solver gives (b*,w}{,w3) = (—1,1,—1), the same solution we 
computed manually, but obtained in less than a millisecond. 














In addition to standard QP-solvers that are publicly available, there are also 
specifically tailored solvers for the linear SVM, which are often more efficient 
for learning from large-scale data, characterized by a large N or d. Some 
packages are based on a version of stochastic gradient descent (SGD), a tech- 
nique for optimization that we introduced in Chapter 3, and other packages 
use more sophisticated optimization techniques that take advantage of special 
properties of SVM (such as the redundancy of non-support vectors). 


Example 8.4 (Comparison of SVM with PLA). We construct a toy data 
set and use it to compare SVM with the perceptron learning algorithm. To 
generate the data, we randomly generate xı € [0,1] and x2 € [—1,1] with 
the target function being +1 above the x-axis; f(x) = sign(x2). In this 
experiment, we used the version of PLA that updates the weights using the 
misclassified point x, with lowest index n. The histogram in Figure 8.1 shows 
what will typically happen with PLA. Depending on the ordering of the data, 
PLA can sometimes get lucky and beat the SVM, and sometimes it can be 
much worse. The SVM (maximum-margin) classifier does not depend on the 
(random) ordering of the data. 
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Eout (SVM) 











0.02 0.04 0.06 0.08 
Eout 


(a) Data and SVM separator (b) Histogram of Fout (PLA) 








Figure 8.1: (a) The SVM classifier from data generated using f(x) = 
sign(x2) (the blue region is f(x) = +1). The margin (cushion) is inside 
the gray lines and the three support vectors are enclosed in boxes. (b) A 
histogram of Four for PLA classifiers using random orderings of the data. 


Exercise 8.6 
Construct a toy data set with N = 20 using the method in Example 8.4. 


(a) Run the SVM algorithm to obtain the maximum margin separator 
(b, w)svm and compute its Hout and margin. 

(b) Construct an ordering of the data points that results in a hyperplane 
with bad Eout when PLA is run on it. [Hint: Identify a positive and 
negative data point for which the perpendicular bisector separating 
these two points is a bad separator. Where should these two points 
be in the ordering? How many iterations will PLA take?] 


(c) Create a plot of your two separators arising from SVM and PLA. 


8.1.2 Is a Fat Separator Better? 


The SVM (optimal hyperplane) is a linear model, so it cannot fit certain simple 
functions as discussed in Chapter 3. But, on the positive side, the SVM also 
inherits the good generalization capability of the simple linear model since the 
VC dimension is bounded by d+ 1. Does the support vector machine gain 
any more generalization ability by maximizing the margin, providing a safety 
cushion so to speak? Now that we have the algorithm that gets us the optimal 
hyperplane, we are in a position to shed some light on this question. 


We already argued intuitively that the optimal hyperplane is robust to 
noise. We now show this link to regularization more explicitly. We have seen 
an optimization problem similar to the one in (8.4) before. Recall the soft- 
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order constraint when we discussed regularization in Chapter 4, 
minimize: Fi,(w) 
w 
subject to: w'w < C. 


Ultimately, this led to weight-decay regularization. In regularization, we min- 
imize Ein given a budget C for w’w. For the optimal hyperplane (SVM), we 
minimize w’w subject to a budget on Ein (namely Ein = 0). In a sense, the 
optimal hyperplane is doing automatic weight-decay regularization. 


optimal hyperplane regularization 





minimize: w'w Ein 
subject to: Ein =0 wiw<C 


Both methods are trying to fit the data with small weights. That is the gist 
of regularization. This link to regularization is interesting, but does it help? 
Yes, both in theory and practice. We hope to convince you of three things. 


(i) A larger-margin separator yields better performance in practice. We will 
illustrate this with a simple empirical experiment. 


(ii) Fat hyperplanes generalize better than thin hyperplanes. We will show 
this by bounding the number of points that fat hyperplanes can shatter. 
Our bound is less than d+ 1 for fat enough hyperplanes. 


(iii) The out-of-sample error can be small, even if the dimension d is large. To 
show this, we bound the cross validation error Eey (recall that Eev is a 
proxy for Eout). Our bound does not explicitly depend on the dimension. 


(i) Larger Margin is Better. We will use the toy learning problem in 
Example 8.4 to study how separators with different margins perform. We ran- 
domly generate a separable data set of size N = 20. There are infinitely many 
separating hyperplanes. We sample these separating hyperplanes randomly. 

For each random separating hyperplane h, we can compute Eout(h) and the 
margin p(h)/p(SVM) (normalized by the maximum possible margin achieved 
by the SVM). We can now plot how Eout(h) depends on the fraction of the 
available margin that h uses. For each data set, we sample 50,000 random 
separating hyperplanes and then average the results over several thousands of 
data sets. Figure 8.2 shows the dependence of Eout versus margin. 

Figure 8.2 clearly suggests that when choosing a random separating hyper- 
plane, it is better to choose one with larger margin. The SVM, which picks 
the hyperplane with largest margin, is doing a much better job than one of the 
(typical) random separating hyperplanes. Notice that once you get to large 
enough margin, increasing the margin further can hurt Fout: it is possible to 
slightly improve the performance among random hyperplanes by slightly sac- 
rificing on the maximum margin and perhaps improving in other ways. Let’s 
now build some theoretical evidence for why large margin separation is good. 
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random hyperplane 











Eo 0.25 E 1 

p(random hyperplane) /p(SVM) 
Figure 8.2: Dependence of Eout versus the margin p for a random separating 
hyperplane. The shaded histogram in the background shows the relative 
frequencies of a particular margin when selecting a random hyperplane. 


(ii) Fat Hyperplanes Shatter Fewer Points. The VC dimension of a hy- 
pothesis set is the maximum number of points that can be shattered. A smaller 
VC dimension gives a smaller generalization error bar that links Ei, to Eout 
(see Chapter 2 for details). Consider the hypothesis set H, containing all 
fat hyperplanes of width (margin) at least p. A dichotomy on a data set 
(x1, y1),-°: , (XN, yn) can be implemented by a hypothesis h € Hp if yn = 
h(Xn) and none of the xn lie inside the margin of h. Let dvo(p) be the maxi- 
mum number of points that H, can shatter.? Our goal is to show that restrict- 
ing the hypothesis set to fat separators can decrease the number of points that 
can be shattered, that is, dvo(p) < d+ 1. Here is an example. 








o 
o 
x 
x 





























Figure 8.3: Thin hyperplanes can implement all 8 dichotomies (the other 
4 dichotomies are obtained by negating the weights and bias). 


Thin (zero thickness) separators can shatter the three points shown above. As 
we increase the thickness of the separator, we will soon not be able to imple- 
ment the rightmost dichotomy. Eventually, as we increase the thickness even 


3Technically a hypothesis in Hp is not a classifier because inside its margin the output is 
not defined. Nevertheless we can still compute the maximum number of points that can be 
shattered and this ‘VC dimension’ plays a similar role to dvc in establishing a generalization 
error bar for the model, albeit using more sophisticated analysis. 
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more, the only dichotomies we can implement will be the constant dichotomies. 


LIAL 


Figure 8.4: Only 4 of the 8 dichotomies can be separated by hyperplanes 
with thickness p. The dashed lines show the thickest possible separator for 
each non-separable dichotomy. 








In Figure 8.4, there is no p-thick hyperplane that can separate the right two 
dichotomies. This example illustrates why thick hyperplanes implement fewer 
of the possible dichotomies. Note that the hyperplanes only ‘look’ thick be- 
cause the data are close together. If we moved the data further apart, then 
even a thick hyperplane can implement all the dichotomies. What matters is 
the thickness of the hyperplane relative to the spacing of the data. 


Exercise 8.7 
Assume that the data is restricted to lie in a unit sphere. 
(a) Show that dyc(p) is non-increasing in p. 
(b) In 2 dimensions, show that dyc(p) < 3 for p > £, [Hint: Show that 
for any 3 points in the unit disc, there must be two that are within 


distance \/3 of each other. Use this fact to construct a dichotomy 
that cannot be implemented by any p-thick separator.] 


The exercise shows that for a bounded input space, thick separators can shatter 
fewer than d+ 1 points. In general, we can prove the following result. 


Theorem 8.5 (VC dimension of Fat Hyperplanes). Suppose the input space 
is the ball of radius R in R4, so ||x|| < R. Then, 


dvc(p) < [ R?/p? | +1, 
where | R?/p? | is the smallest integer greater than or equal to R?/p°. 


We also know that the VC dimension is at most d + 1, so we can pick the 
better of the two bounds, and we gain when R/p is small. The most important 
fact about this margin-based bound is that it does not explicitly depend on 
the dimension d. This means that if we transform the data to a high, even 
infinite, dimensional space, as long as we use fat enough separators, we obtain 
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good generalization. Since this result establishes a crucial link between the 
margin and good generalization, we give a simple, though somewhat technical, 
proof. 


Begin safe skip: The proof is cute but not essential. 





A similar green box will tell you when to rejoin. 


Proof. Fix x,,...,Xn that are shattered by hyperplanes with margin p. We 
will show that when N is even, N < R?/p? +1. When N is odd, a similar but 
more careful analysis (see Problem 8.8) shows that N < R?/p?+1+ 74. In 
both cases, N < | R?/p?] +1. 

Assume N is even. We need the following geometric fact (which is proved 
in Problem 8.7). There exists a (balanced) dichotomy y1,...,Yn such that 


N N 
5 Yn = 0, and 5 YnXn 
n=1 n=1 


(For random yn, Lae YnXn is a random walk whose distance from the origin 
doesn’t grow faster than RV N.) The dichotomy satisfying (8.7) is separated 
with margin at least p (since x1,...,xXy is shattered). So, for some (w, b), 


N 
fast (8.7) 
N-1 














pllw|| < yn(w" Xn + b), forn=1,...,N. 


Summing over n, using }>,, Yn = 0 and the Cauchy-Schwarz inequality, 


N 
X YnXn 
n=1 


By the bound in (8.7), the RHS is at most ||w||N.R/VN — 1, or: 


a 
n EEEREN 
N—1 p? 


End safe skip: Welcome back for a summary. 


Combining Theorem 8.5 with dyc(p) < dvo(0) = d + 1, we have that 
dvc(p) < min (| R?/p?|,d) +1. 


The bound suggests that p can be used to control model complexity. The 
separating hyperplane (Ein = 0) with the maximum margin p will have the 
smallest dvc(p) and hence smallest generalization error bar. 

Unfortunately, there is a complication. The correct width p to use is not 
known to us ahead of time (what if we chose a higher p than possible’). 
The optimal hyperplane algorithm fixes p only after seeing the data. Giving 
yourself the option to use a smaller p but settling on a higher p when you 
see the data means the data is being snooped. One needs to modify the VC 
analysis to take into account this kind of data snooping. The interested reader 
can find the details related to these technicalities in the literature. 


N N N 
Nop||w|| < w” 5 YnXn + DÐ Yn = W” 5 YnXn < ||w]| 
n=1 n=1 


n=1 
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(iii) Bounding the Cross Validation Error. In Chapter 4, we intro- 
duced the leave-one-out cross validation error Eey which is the average of the 
leave-one-out errors en. Each e, is computed by leaving out the data point 
(Xn, Yn) and learning on the remaining data to obtain g. The hypothesis g, 
is evaluated on (Xn, Yn) to give en, and 


1a 
Ew = 77 Den: 


n=1 


An important property of Eev is that it is an unbiased estimate of the expected 
out-of-sample error for a data set of size N — 1. (see Chapter 4 for details 
on validation). Hence, Eev is a very useful proxy for Eou. We can get a 
surprisingly simple bound for Eev using the number of support vectors. Recall 
that a support vector lies on the very edge of the margin of the optimal 
hyperplane. We illustrate a toy data set in Figure 8.5(a) with the support 
vectors highlighted in boxes. 


























x 
x 
x 


(a) All data (b) Only support vectors 








Figure 8.5: (a) Optimal hyperplane with support vectors enclosed in boxes. 
(b) The same hyperplane is obtained using only the support vectors. 


The crucial observation as illustrated in Figure 8.5(b) is that if any (or all) the 
data points other than the support vectors are removed, the resulting separator 
produced by the SVM is unchanged. You are asked to give a formal proof of 
this in Problem 8.9, but Figure 8.5 should serve as ample justification. This 
observation has an important implication: e,, = 0 for any data point (Xn, Yn) 
that is not asupport vector. This is because removing that data point results in 
the same separator, and since (Xn, Yn) was classified correctly before removal, 
it will remain correct after removal. For the support vectors, en < 1 (binary 
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classification error), and so 


N 
1 <i support vectors 
Ew (SVM) = — ye m ; 
(SV wd N (8.8) 


Exercise 8.8 
(a) Evaluate the bound in (8.8) for the data in Figure 8.5. 


(b) If one of the four support vectors in a gray box are removed, does the 
classifier change? 


(c) Use your answer in (b) to improve your bound in (a). 


The support vectors in gray boxes are non-essential and the support vector 
in the black box is essential. One can improve the bound in (8.8) to use only 
essential support vectors. The number of support vectors is unbounded, but 
the number of essential support vectors is at most d+ 1 (usually much less). 


In the interest of full disclosure, and out of fairness to the PLA, we should 
note that a bound on Eey can be obtained for PLA as well, namely 


R? 


Ee(PLA) < Fa 


where p is the margin of the thickest hyperplane that separates the data, and R 
is an upper bound on ||x,|| (see Problem 8.11). The table below provides a 
summary of what we know based on our discussion so far. 


Algorithm For Selecting Separating Hyperplane 
General SVM (Optimal Hyperplane) 


dvc(p) < min (| & |,¢) +1 


# support vectors 


Ew < 
N 





In general, all you can conclude is the VC bound based on a VC dimension of 
d+ 1. In high dimensions, this bound can be a very loose. For PLA or SVM, 
we have additional bounds that do not explicitly depend on the dimension d. 
If the margin is large, or if the number of support vectors is small (even in 
infinite dimensions), we are still in good shape. 


8.1.3 Non-Separable Data 


Our entire discussion so far assumed that the data is linearly separable and 
focused on separating the data with maximum safety cushion. What if the 
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(a) Few noisy data. (b) Nonlinearly separable. 


Figure 8.6: Non-separable data (reproduction of Figure 3.1). 


data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) il- 
lustrates the two types of non-separability. In Figure 8.6(a), two noisy data 
points render the data non-separable. In Figure 8.6(b), the target function is 
inherently nonlinear. 

For the learning problem in Figure 8.6(a), we prefer the linear separator, 
and need to tolerate the few noisy data points. In Chapter 3, we modified the 
PLA into the pocket algorithm to handle this situation. Similarly, for SVMs, 
we will modify the hard-margin SVM to the soft-margin SVM in Section 8.4. 
Unlike the hard margin SVM, the soft-margin SVM allows data points to 
violate the cushion, or even be misclassified. 

To address the other situation in Figure 8.6(b), we introduced the nonlinear 
transform in Chapter 3. There is nothing to stop us from using the nonlinear 
transform with the optimal hyperplane, which we will do here. 

To render the data separable, we would typically transform into a higher 
dimension. Consider a transform ®: R? — R?. The transformed data are 


Zn = B(xy). 


After transforming the data, we solve the hard-margin SVM problem in the 
Z space, which is just (8.4) written with Zn instead of xn: 


1 
minimize: -wW (8.9) 
bw 2 
subject to: Yn (wz, +5) 21 (n=1,-++,N), 


where w is now in Rf instead of R? (recall that we use tilde for objects in 
Z space). The optimization problem in (8.9) is a QP-problem with d+ 1 
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(a) SVM with ®2 (b) SVM with 3 


Figure 8.7: Nonlinear separation using the SVM with 2nd and 3rd order 
polynomial transforms. The margin is shaded in yellow, and the support 
vectors are boxed. The dimension of ®3 is nearly double that of ®2, yet the 
resulting SVM separator is not severely overfitting with ®3. 


optimization variables and N constraints. To solve this optimization problem, 
we can use the standard hard-margin algorithm in the algorithm box on page 8- 
10, after we replace x,, with zn and d with d. The algorithm returns an optimal 
solution b*, w* and the final hypothesis is 


g(x) = sign(w*"®(x) + b*). 


In Chapter 3, we introduced a general kth-order polynomial transform ®,. 
For example, the second order polynomial transform is 


(x) = (£1, 22,27, £122, £3). 


Figure 8.7 shows the result of using the SVM with the 2nd and 3rd order 
polynomial transforms ®2, ®3 for the data in Figure 8.6(b). Observe that the 
‘margin’ does not have constant width in the ¥ space; it is in the Z space that 
the width of the separator is uniform. Also, in Figure 8.7(b) you can see that 
some blue points near the top left are not support vectors, even though they 
appear closer to the separating surface than the red support vectors near the 
bottom right. Again this is because it is distances to the separator in the Z 
space that matter. Lastly, the dimensions of the two feature spaces are dy =5 
and d3 = 9 (almost double for the 3rd order polynomial transform). However, 
the number of support vectors increased from 5 to only 6, and so the bound 
on Eey did not nearly double. 

The benefit of the nonlinear transform is that we can use a sophisticated 
boundary to lower Ein. However, you pay a price to get this sophisticated 
boundary in terms of a larger dvc and a tendency to overfit. This trade-off is 
highlighted in the table below for PLA, to compare with SVM. 
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perceptrons | perceptron + nonlinear transform 





complexity 
zontal small dve large dvo 
boundary linear sophisticated 


We observe from Figure 8.7(b) that the 3rd order polynomial SVM does not 
show the level of overfitting we might expect when we nearly double the num- 
ber of free parameters. We can have our cake and eat it too: we enjoy the 
benefit of high-dimensional transforms in terms of getting sophisticated bound- 
aries, and yet we don’t pay too much in terms of Eout because dyco and Eev 
can be controlled in terms of quantities not directly linked to d. The table 
illustrating the trade-offs for the SVM is: 





SVM SVM + nonlinear transform 
complexity smaller ‘effective’ dvc | not too large ‘effective’ dve 
control 
boundary linear sophisticated 


You now have the support vector machine at your fingertips: a very powerful, 
easy to use linear model which comes with automatic regularization. We could 
be done, but instead, we are going to introduce you to a very powerful tool 
that can be used to implement nonlinear transforms called a kernel. The kernel 
will allow you to fully exploit the capabilities of the SVM. The SVM has a 
potential robustness to overfitting even after transforming to a much higher 
dimension that opens up a new world of possibilities: what about using infinite 
dimensional transforms? Yes, infinite! It is certainly not feasible within our 
current technology to use an infinite dimensional transform; but, by using the 
‘kernel trick’, not only can we make infinite dimensional transforms feasible, 
we can also make them efficient. Stay tuned; it’s exciting stuff. 


8.2 Dual Formulation of the SVM 


The promise of infinite dimensional nonlinear transforms certainly whets the 
appetite, but we are going to have to earn our cookie Œ). We are going to 
introduce the dual SVM problem which is equivalent to the original primal 
problem (8.9) in that solving both problems gives the same optimal hyper- 
plane, but the dual problem is often easier. It is this dual (no pun intended) 
view of the SVM that will enable us to exploit the kernel trick that we have 
touted so loudly. But, the dual view also stands alone as an important formula- 
tion of the SVM that will give us further insights into the optimal hyperplane. 
The connection between primal and dual optimization problems is a heavily 
studied subject in optimization theory, and we only introduce those few pieces 
that we need for the SVM. z 7 

Note that (8.9) is a QP-problem with d + 1 variables (b,w) and N con- 
straints. It is computationally difficult to solve (8.9) when d is large, let alone 
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infinite. The dual problem will also be a QP-problem, but with N variables 
and N + 1 constraints — the computational burden no longer depends on d, 
which is a big saving when we move to very high d. 

Deriving the dual is not going to be easy, but on the positive side, we 
have already seen the main tool that we will need (in Section 4.2 when we 
talked about regularization). Regularization introduced us to the constrained 
minimization problem 


minimize: Ein(w) 
w 
subject to: ww <C, 


where C is a user-specified parameter. We showed that for a given C', there 
is a A such that minimizing the augmented error Eaug(w) = Ej;n(w) + Aw" w 
gives the same regularized solution. We viewed the Lagrange multiplier as 
the user-specified parameter instead of C, and minimized the augmented error 
instead of solving the constrained minimization problem. Here, we are also 
going to use Lagrange multipliers, but in a slightly different way. The Lagrange 
multipliers will arise as variables that correspond to the constraints, and we 
need to formulate a new optimization problem (which is the dual problem) to 
solve for those variables. 


8.2.1 Lagrange Dual for a QP-Problem 


Let us illustrate the concept of the dual with a simplified version of the stan- 
dard QP-problem, using just one inequality constraint. All the concepts will 
easily generalize to the case with more than one constraint. So, consider the 
QP-problem 


minimize: su'Qu+p"u (8.10) 
ucRE 
subject to: a'u >c 


Here is a closely related optimization problem. 


mipimuze: su'Qu+ p™u+ nae (c—a‘u). (8.11) 
The variable œ > 0 multiplies the constraint (c—a™u).* To obtain the objective 
in (8.11) from (8.10), we add what amounts to a penalty term that encourages 
(c— au) to be negative and satisfy the constraint. The optimization problem 
in (8.10) is equivalent to the one in (8.11) as long as there is at least one 
solution that satisfies the constraint in (8.10). The advantage in (8.11) is that 
the minimization with respect to u is unconstrained, and the price is a slightly 
more complex objective that involves this ‘Lagrangian penalty term’. The 
next exercise proves the equivalence. 

~~ are: parameter ais called a Lagrange multiplier. In the optimization literature, the 


Lagrange multiplier would typically be denoted by A. Historically, the SVM literature has 
used a, which is the convention we follow. 
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Exercise 8.9 
Let uo be optimal for (8.10), and let u; be optimal for (8.11). 


(a) Show that max a (c — auo) = 0. [Hint: c—a*uo < 0.] 


(b) Show that uz is feasible for (8.10). To show this, suppose to the 
contrary that c — a™u; > 0. Show that the objective in (8.11) is 
infinite, whereas uo attains a finite objective of 4up Quo + p'uo, 
which contradicts the optimality of uy. 

(c) Show that surQui + pu = 5p Quo + p*uo, and hence that uj 
is optimal for (8.10) and uo is optimal for (8.11). 

(d) Let u* be any optimal solution for (8.11) with max a(c — a'u“) 


attained at a*. Show that 
ale-a u =0. (8.12) 


Either the constraint is exactly satisfied with c—a*u* = 0, or a* = 0. 


Exercise 8.9 shows that as long as the original problem in (8.10) is feasible, 
we can obtain a solution by solving (8.11) instead. Let us analyze (8.11) in 
further detail. Our goal is to simplify it. Introduce the Lagrangian function 
L£(u, a), defined by 


L(u, a) = Su"Qu+ p’u+a(c—a’u). (8.13) 
In terms of £, the optimization problem in (8.11) is 


min max £ (u, a). (8.14) 
For convex quadratic programming, when £ (u, a) has the special form in (8.13) 
and c — au < 0 is feasible, a profound relationship known as strong duality 
has been proven to hold: 


min max £(u, a) = max min L(u, a). (8.15) 
u azo a>0 u 

An interested reader can consult a standard reference in convex optimization 
for a proof. The impact for us is that a solution to the optimization problem 
on the RHS of (8.15) gives a solution to the problem on the LHS, which 
is the problem we want to solve. This helps us because on the RHS, one 
is first minimizing with respect to an unconstrained u, and that we can do 
analytically. This analytical step considerably reduces the complexity of the 
problem. Solving the problem on the RHS of (8.15) is known as solving the 
Lagrange dual problem (dual problem for short). The original problem is called 

the primal problem. 
We briefly discuss what to do when there are many constraints, ap, u > Cm 
for m = 1,...,M. Not much changes. All we do is introduce a Lagrange 
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multiplier a,, > 0 for each constraint and add the penalty term a,,(cCm—az,u) 
into the Lagrangian. Then, just as before, we solve the dual problem. A simple 
example will illustrate all the mechanics. 


Example 8.6. Let’s minimize u? + u3 subject to u1 +2u2 > 2 and u1, u2 > 0. 
We first construct the Lagrangian, 





L(u, a) = u? + už + a1 (2 — u — 2u2) — agui — azue, 


where, as you can see, we have added a penalty term for each of the three 
constraints, and each penalty term has an associated Lagrange multiplier. To 
solve the dual optimization problem, we first need to minimize £L(u, œ) with 
respect to the unconstrained u. We then need to maximize with respect to 
a > 0. To do the unconstrained minimization with respect to u, we use the 
standard first derivative condition from calculus: 














OL Qi + AQ 

>=— =0 Ww = —; 

Out 2 (x) 
OL 0 E 2a, + a3 

uz 2 i 


Plugging these values for wu; and ug back into £ and collecting terms, we have 
a function only of a, which we have to maximize with respect to a1, a2, Q3: 





maximize: L(a) = Fai +03 403 taia —aya3 + 2a, 
subject to: a1,02,a3 > 0 


Exercise 8.10 


Do the algebra. Derive (x) and plug it into £(u, a) to obtain L(a). 


By going to the dual, all we accomplished is to obtain another QP-problem! 
So, in a sense, we have not solved the problem at all. What did we gain? The 
new problem is easier to solve. This is because the constraints for the dual 
problem are simple (œ > 0). In our case, all terms involving ag and a3 are 
at most zero, and we can attain zero by setting ag = a3 = 0. This leaves us 
with —Fai + 2a,, which is maximized when a; = $, So the final solution is 
a = F, Q2 = a3 = 0, and plugging these into (*) gives 


ui = 2 and uz = Z, 


The optimal value for the objective is u? + uł = 4. 














We summarize our discussion in the following theorem, which states the 
dual formulation of a QP-problem. In the next section, we will show how 
this dual formulation is applied to the SVM QP-problem. The theorem looks 
formidable, but don’t be intimidated. Its application to the SVM QP-problem 
is conceptually no different from our little example above. 
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Theorem 8.7 (KKT). For a feasible convex QP-problem in primal form, 


minimize: su'Qu + p*u 
ucR’ 
subject to: a uU > Cm (m=1,---,M), 


define the Lagrange function 


M 
L(u,a) = 5u’Qu+ put D Am (Cm — apu). 


m=1 


The solution u* is optimal for the primal if and only if (u*,a@*) is a solution 
to the dual optimization problem 


max min £(u, a). 
a>0 u 
The optimal (u*, a*) satisfies the Karush-Kühn-Tucker (KKT) conditions: 
(i) Primal and dual constraints: 


Ty cot 
amu Z Cm 


(ii) Complementary slackness: 


ar, (Ayu — Cm) = 0. 


(iii) Stationarity with respect to u: 


VuL(u, @)| 





The three KKT conditions in Theorem 8.7 characterize the relationship be- 
tween the optimal u* and a* and can often be used to simplify the Lagrange 
dual problem. The constraints are inherited from the constraints in the defini- 
tions of the primal and dual optimization problems. Complementary slackness 
is a condition you derived in (8.12) which says that at the optimal solution, 
the constraints fall into two types: those which are on the boundary and are 
exactly satisfied (the active constraints) and those which are in the interior of 
the feasible set. The interior constraints must have Lagrange multipliers equal 
to zero. The stationarity with respect to u is the necessary and sufficient 
condition for a convex program to solve the first part of the dual problem, 
namely minu L(u, œ). Next, we focus on the hard-margin SVM, and use the 
KKT conditions, in particular, stationarity with respect to u, to simplify the 
Lagrange dual problem. 


8.2.2 Dual of the Hard-Margin SVM 


We now apply the KKT theorem to the convex QP-problem for hard-margin 
SVM (8.9). The mechanics of our derivation are not more complex than in 
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Example 8.6, though the algebra is more cumbersome. The steps we followed 
in Example 8.6 are a helpful guide to keep in mind. 

The hard-margin optimization problem only applies when the data is lin- 
early separable. This means that the optimization problem is convex and 
feasible, so the KKT theorem applies. For your convenience, we reproduce the 
hard-margin SVM QP-problem here, 


minimize: iww (8.16) 
b,w 
subject to: Yn(w™Xxn +b) > 1 (n=1,--:,N). 
The optimization variable is u = [°]. The first task is to construct the La- 


grangian. There are N constraints, so we add N penalty terms, and introduce 
a Lagrange multiplier a, for each penalty term. The Lagrangian is 


N 
L(b,w,a) = $w'w+ Ð an (1 — yn(w'xn +0)) 


n=1 


N N N 
= sw'w — J anynw Xn — bY anyn + SS an. (8.17) 


n=1 n=1 n=1 


We must first minimize £ with respect to (b,w) and then maximize with 
respect to a > 0. To minimize with respect to (b, w), we need the derivatives 
of L. The derivative with respect to b is just the coefficient of b because 
b appears linearly in the Lagrangian. To differentiate the terms involving 
w, we need some vector calculus identities from the linear algebra appendix: 


a w"w = 2w and 2 w"x = x. The reader may now verify that 
w Ow 


OL X OL 2 
> — —S Zonta and a = we SC anynXn. 
CT F ow nal 


Setting these derivatives to zero gives 


N 
> atin, = 0; (8.18) 
n=l 
N 
w= 5 AnYnXn- (8.19) 
n=1 


Something strange has happened (highlighted in red), which did not happen in 
Example 8.6. After setting the derivatives to zero in Example 8.6, we were able 
to solve for u in terms of the Lagrange multipliers a. Here, the stationarity 
condition for b does not allow us to solve for b directly Œ). Instead, we got a 
constraint that œ must satisfy at the final solution. Don’t let this unsettle you. 
The KKT theorem will allow us to ultimately recover b ©). This constraint 
on @ is not surprising. Any choice for œ which does not satisfy (8.18) would 
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allow £L — —co by appropriately choosing b. Since we maximize over a, we 
must choose æ to satisfy the constraint and prevent L —> —oo. 

We proceed as in Example 8.6 by plugging the stationarity constraints back 
into the Lagrangian to get a function solely in terms of a. Observe that the 
constraint in (8.18) means that the term involving b in the Lagrangian (8.17) 
reduces to zero. The terms in the Lagrangian (8.17) involving the weights w 
simplify when we use the expression in (8.19): 


N 
sw'w— > AnYnW' Xn 


n=1 


1 N N N N 
= z 5 AnYnX, 5 AmYmXm — 5 An Yn 5 AmYmX m Xn 





n=1 m=1 n=1 m=1 
wt WwW wT 
1 N N N N 

X X T y X p 
= z YnYmAnAmXnXm YnYmAnAmXnXm 

n=1m=1 n=1lm=1 

N N 

1 T 

= J ) YnYmAnAmXnXm- (8.20) 
n=1m=1 


After we use (8.18) and (8.20) in (8.17), the Lagrangian reduces to a simpler 
function of just the variable a: 


N 


MAr © 5 5$ r 
B J YnYMANAMXp Xm + 5 Qn- 


n=1m=1 n=1 


We must now maximize £(a) subject to a > 0, and we cannot forget the 
constraint (8.18) which we inherited from the stationarity condition on b. We 
can equivalently minimize the negative of £(a), and so we have the following 
optimization problem to solve. 


ee p N 
minimize: = NYMANAMX, Xm — An, 8.21 
nimi; > y 7 2 (8.21) 
N 
subject to: = Yn, =O 
nl 
Qn > 0 m=i ,N) 


If you made it to here, you deserve a pat on the back for surviving all the 
algebra ©). The result is similar to what happened in Example 8.6. We have 
not solved the problem; we have reduced it to another QP-problem. The next 
exercise guides you through the steps to put (8.21) into the standard QP form. 
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Exercise 8.11 


(a) Show that the problem in (8.21) is a standard QP-problem: 


minimize: da"Qvoa—lyva (8.22) 
aeERN 
subject to: Ana > On+2, 


where Q» and A» (D for dual) are given by: 


YIyiX1X1 ---  Y1YNXIXN ? 
YyoyiXoX1  .-. Y2YNXOXN y 
On — p ; i and Ay = | ~y" 
2 R : 3 A Inxw 
YNYIXNXI Aee YNYNXNXN 


[Hint: Recall that an equality corresponds to two inequalities.] 


(b) The matrix Q» of quadratic coefficients is [Qo]mn = YmYnXmXn- 
Show that Q» = X. X;, where Xs is the ‘signed data matrix’, 


ie 
SS 
ae 
— 42x, — 
<i 
a 
—ynxn— 


Hence, show that Q, is positive semi-definite. This implies that the 
QP-problem is convex. 


Example 8.8. Let us illustrate all the machinery of the dual formulation using 
the data in Example 8.2 (the toy problem from Section 8.1). For your conve- 
nience, we reproduce the data, and we compute Qp, Ap using Exercise 8.11: 





eae a pt 
0 0 =a 000 0 1 P20 aE 

2 2 A 0 8-4-6 1 0 0 0 
X= lo ol? YSjyl & o-4 4 6 5| o ao 4 
3 0 +1 0-6 6 9 001 0 
000 1 


We can now write down the optimization problem in (8.21) and manually solve 
it to get the optimal œ. The masochist reader is invited to do so in Problem 8.3. 
Instead, we can use a QP-solver to solve (8.22) in under a millisecond: 


a* <— QP(Q», —1, An, 0) gives a* = 


* 
OF nle NIP 
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Now that we have a*, we can compute the optimal weights using (8.19), 


os 33 LT 1 f2). fe 1 
Dworsn=—]o] =+ G) + fo] = [a] 
n=1 

As expected, these are the same optimal weights we got in Example 8.2. What 
about b? This is where things get tricky. Recall the complementary slackness 
KKT condition in Theorem 8.7. It asserts that if až > 0, then the corre- 
sponding constraint must be exactly satisfied. In our example, a, = 4 > 0, so 
yi(w*?x, + b*) = 1. Since x; = 0 and yı = —1, this means b* = —1, exactly 
as in Example 8.2. So, g(x) = sign(a1 — x2 — 1). 


w* 

















On a practical note, the N x N matrix Q, is often dense (containing lots 
of non-zero elements). If N = 100,000, storing Qp uses more than 10GB of 
RAM. Thus, for large-scale applications, specially tailored QP packages that 
dynamically compute Qp and use specific properties of SVM are often used to 
solve (8.22) efficiently. 


8.2.3 Recovering the SVM from the Dual Solution 


Now that we have solved the dual problem using quadratic programming to 
obtain the optimal solution a*, what remains is to compute the optimal hy- 
perplane (b*, w*). The weights are easy, and are obtained using the stationary 
condition in (8.19): 


N 
w= 5 Ynad Xn. (8.23) 

n=1 
Assume that the data contains at least one positive and one negative example 
(otherwise the classification problem is trivial). Then, at least one of the 
{ax} will be strictly positive. Let at > 0 (s for support vector). The KKT 
complementary slackness condition in Theorem 8.7 tells us that the constraint 

corresponding to this non-zero a% is equality. This means that 


Ys (w**x, + 0*)=1. 


We can solve for b* in the above equation to get 


by = Ys — W"'Xs 
N 
= Yo— > ynORxnXe. (8.24) 
n=1 
Exercise 8.12 
If all the data is from one class, then až = 0 for n = 1,..., N. 


(a) What is w*? 
(b) What is b*? 
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Equations (8.23) and (8.24) connect the optimal a*, which we get from solving 
the dual problem, to the optimal hyperplane (b*, w*) which solves (8.4). Most 
importantly, the optimal hypothesis is 


g(x) = sign(w*™x + b*) 
N 
= sign ( YS Yna% Xx, xX + r) (8.25) 
n=1 
N 
= sign ( D Yna Xn (X — Xs) + v) . 
n=1 


Recall that (xs, ys) is any support vector which is defined by the condition 
až > 0. There is a summation over n in Equations (8.23), (8.24) and (8.25), 
but only the terms with ağ} > 0 contribute to the summations. We can 
therefore get a more efficient representation for g(x) using only the positive až: 


g(x) = sign ( X yna xx + r) l (8.26) 


až >0 


where b* is given by (8.24) (the summation (8.24) can also be restricted to 
only those až > 0). So, g(x) is determined by only those examples (Xn, Yn) 
for which až > 0. We summarize our long discussion about the dual in the 
following algorithm box. 


Hard-Margin SVM with Dual QP 
1: Construct Qp and Ap as in Exercise 8.11 


YıyıXiXı -.- Y1YNXIXN 


Y2Y1X3X1 :.. YLYNXZXN 
and Ap = 


. " * Š H InxNn 
YNYIXNXI --- YNYNXNXN 
2: Use a QP-solver to optimize the dual problem: 


a* — QP(Q,, —1y, Ap, On +2). 


3: Let s be a support vector for which at > 0. Compute b*, 


* * T. 
b* = Ys — > Yna X, Xs. 
a*>0 


4: Return the final hypothesis 





g(x) = sign ( X Yna% xX, xX + v) : 


a*>0 
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The support vectors are the examples (Xs, ys) for which at > 0. The sup- 
port vectors play two important roles. First, they serve to identify the data 
points on boundary of the optimal fat-hyperplane. This is because of the 
complementary slackness condition in the KKT theorem: 


Ys (w**x, + 0*) = 1. 


This condition identifies the support vectors as being closest to, in fact on, 
the boundary of the optimal fat-hyperplane. The leads us to an interesting 
geometric interpretation: the dual SVM identifies the support vectors on the 
boundary of the optimal fat-hyperplane, and uses only those support vectors 
to construct the final classifier. We already highlighted these support vectors 
in Section 8.1, where we used the term support vector to highlight the fact that 
these data points are ‘supporting’ the cushion, preventing it from expanding 
further. 


Exercise 8.13 


KKT complementary slackness gives that if až > 0, then (Xn, yn) is on 
the boundary of the optimal fat-hyperplane and yn (w* xn + b*) = 1. 
Show that the reverse is not true. Namely, it is possible that aj, = 0 and 
yet (Xn, Yn) is on the boundary satisfying yn (w*’ Xn + b*) = 1. 

[Hint: Consider a toy data set with two positive examples at (0,0) and 
(1,0), and one negative example at (0, 1).] 


The previous exercise says that from the dual, we identify a subset of the 
points on the boundary of the optimal fat-hyperplane which are called support 
vectors. All points on the boundary are support vector candidates, but only 
those with až > 0 (and contribute to w*) are the support vectors. 

The second role played by the support vectors is to determine the final 
hypothesis g(x) through (8.26). The SVM dual problem directly identifies the 
data points relevant to the final hypothesis. Observe that only these support 
vectors are needed to compute g(x). 


Example 8.9. In Example 8.8 we computed a* = (5, 4, 1,0) for our toy 
problem. Since aj is positive, we can choose (x1, y1) = (0, —1) as our support 
vector to compute b* in (8.24). The final hypothesis is 


g(x) sign (—3 -[22]- [23] +[20]- [zi] —1) 


= sign(x,;—22—1), 











computed now for the third time ©). 





We used the fact that only the support vectors are needed to compute the 
final hypothesis to derive the upper bound on Eev given in (8.8); this upper 
bound depends only on the number of support vectors. Actually, our bound 
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on Eey is based on the number of support vector candidates. Since not all 
support vector candidates are support vectors, one usually gets a much better 
bound on Eey by instead using the number of support vectors. That is, 


number of a% > 0 


v (8.27) 


cv = 


To prove this bound, it is necessary to show that if one throws out any data 
point with a,, = 0, the final hypothesis is not changed. The next exercise asks 
you to to do exactly this. 


Exercise 8.14 
Suppose that we removed a data point (Xn, yn) with aj, = 0. 


(a) Show that the previous optimal solution œ* remains feasible for the 
new dual problem (8.21) (after removing a% ) . 


(b) Show that if there is any other feasible solution for the new dual 
that has a lower objective value than a*, this would contradict the 
optimality of a* for the original dual problem. 


(c) Hence, show that œ* (minus a},) is optimal for the new dual. 
(d) Hence, show that the optimal fat-hyperplane did not change. 
(e) Prove the bound on Eev in (8.27). 


In practice, there typically aren’t many support vectors, so the bound in (8.27) 
can be quite good. Figure 8.8 shows a data set with 50 random data points 
and the resulting optimal hyperplane with 3 support vectors (in black boxes). 
The support vectors are identified from the dual solution (a*, > 0). Figure 8.8 








Figure 8.8: Support vectors identified from the SVM dual (3 black boxes). 
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supports the fact that there are often just a few support vectors. This means 
that the optimal a* is usually sparse, containing many zero elements and a 
few non-zeros. The sparsity property means that the representation of g(x) 
in (8.26) can be computed efficiently using only these few support vectors. If 
there are many support vectors, it is usually more efficient to compute (b*, w*) 
ahead of time, and use g(x) = sign(w*?x + b*) for prediction. 


8.3 Kernel Trick for SVM 


We advertised the kernel as a way to use nonlinear transforms into high di- 
mensional spaces efficiently. We are now going to deliver on that promise for 
SVM. In order to couple the kernel with SVM, we need to view SVM from 
the dual formulation. And that is why we expended considerable effort to 
understand this alternative dual view of SVM. The kernel, together with the 
dual formulation, will allow us to efficiently run SVM with transforms to high 
or even infinite dimensional spaces. 


8.3.1 Kernel Trick via Dual SVM 


Let’s start by revisiting the procedure for solving nonlinear SVM from the 
dual formulation based on a nonlinear transform ®: X — Z, which can be 
done by replacing x by z = ®(x) in the algorithm box on page 8-30. First, 
calculate the coefficients for the dual problem that includes the Q, matrix; 
then solve the dual problem to identify the non-zero Lagrange multipliers œ% 
and the corresponding support vectors (Xn, Yn); finally, use one of the support 
vectors to calculate b*, and return the hypothesis g(x) based on b*, the support 
vectors, and their Lagrange multipliers. 

Throughout the procedure, the only step that may depend on d, which is 
the dimension of ®(x), is in calculating the Z-space inner product 


p(x) P(x’). 


This inner product is needed in the formulation of Qp and in the expression 
for g(x). The ‘kernel trick’ is based on the following idea: If the transform 
and the inner product can be jointly and efficiently computed in a way that 
is independent of d, the whole nonlinear SVM procedure can be carried out 
without computing/storing each ®(x) explicitly. Then, the procedure can 
work efficiently for a large or even an infinite d. 

So the question is, can we effectively do the transform and compute the 
inner product in an efficient way irrespective of d? Let us first define a function 
that combines both the transform and the inner product: 


Kə(x, x’) = ®(x)"@(x’). (8.28) 


This function is called a kernel function, or just kernel. The kernel takes as 
input two vectors in the XY space and outputs the inner product that would 
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be computed in the Z space (for the transform ®). In its explicit form (8.28), 
it appears that the kernel transforms the inputs x and x’ to the Z space 
and then computes the inner product. The efficiency of this process would 
certainly depend on the dimension of the Z space, which is d. The question 
is whether we can compute Ka(x, x’) more efficiently. 

For two cases of specific nonlinear transforms ®, we are going to demon- 
strate that their corresponding kernel functions Kg can indeed be efficiently 
computed, with a cost proportional to d instead of d. (For simplicity, we will 
use K instead of Kg when ® is clear from the context.) 


Polynomial Kernel. Consider the second-order polynomial transform: 
P(x) = (1,21,%2,...,@a,11%1,2102,...,UadL£a), 


where d= 1 +d+d? (the identical features x;x; and xjx; are included sepa- 
rately for mathematical convenience as you will see below). A direct compu- 
tation of ®2(x) takes O(d) in time, and thus a direct computation of 


a(x) (x) = Sa P 


selel 


also takes time O(d). We can simplify the double summation by reorganizing 
the terms into a product of two separate summations: 


d d 
) > jeje = =% Tiz, X ) rit; = (x™x’)?, 
i1 j=1 i= 


Therefore, we can calculate 62(x)’®2(x’) by an equivalent function 
i fk, x") = 14 (x™x’) + (x™x’)?. 


In this instance, we see that K can be easily computed in time O(d), which is 
asymptotically faster than d. So, we have demonstrated that, for the polyno- 
mial transform, the kernel K is a mathematical and computational shortcut 
that allows us to combine the transform and the inner product into a single 
more efficient function. 

If the kernel K is efficiently computable for some specific ®, as is the case 
for our polynomial transform, then whenever we need to compute the inner 
product of transformed inputs in the Z space, we can use the kernel trick 
and instead compute the kernel function of those inputs in the ¥ space. Any 
learning technique that uses inner products can benefit from this kernel trick. 

In particular, let us look back at the SVM dual problem transformed into 
the Z space. We obtain the dual problem in the Z space by replacing ev- 
ery instance of x, in (8.21) with Zn = ®(x,). After we do this, the dual 
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optimization problem becomes: 


ee Le : X 
minimize: 5 2 2 YnYmAnAm Z, Zm 2 Qn (8.29) 
N 
subject to: Do YnAn =0 
n=1 
Qn > 0 (n=1,---,N) 


Now, recall the steps to obtain our final hypothesis. We solve the dual to 
obtain the optimal a*. For a support vector (Zs,ys) with at > 0, define 
b* = ys — X ar >0 YnOnZnZs- Then, the final hypothesis from (8.25) is 


N 
g(x) = sign ( So Yna% ZZ + r) ; 
n=1 


where z = ®(x). In the entire dual formulation, z,, and z only appear as inner 
products. If we use the kernel trick to replace every inner product with the 
kernel function, then the entire process of solving the dual to getting the final 
hypothesis will be in terms of the kernel. We need never visit the Z space to 
explicitly construct Zn or z. The algorithm box below summarizes these steps. 


Hard-Margin SVM with Kernel 
1: Construct Qp from the kernel K, and Ap: 


yyiKi1 ... ywwynKin 
yoyiKa ... yoynKen 


and A, = 


é È a Inxn 
ynyikn1... YNYNKNN 


where Kmn = K(Xm,Xn). (K is called the Gram matrix.) 
: Use a QP-solver to optimize the dual problem: 


a“ + QP(Q,, —1n, Ap, On 42). 


: Let s be any support vector for which a > 0. Compute 


b* = Ys — 5 Yna K (Xn, Xs). 


an >0 


: Return the final hypothesis 


g(x) = sign y Yna K (xn, X) + b* 
až >0 
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In the algorithm, the dimension of the Z space has disappeared, and the 
running time depends on the efficiency of the kernel, and not d. For our 
polynomial kernel this means the efficiency is determined by d. 

An efficient kernel function relies on a carefully constructed specific trans- 
form to allow fast computation. If we consider another 2nd-order transform 


(x) = (3,1-x£1,4: £2,..., 1- £d, 5- £121,9: T1T2,..., 2- aka) 


that multiplies each component of the original transform by some arbitrary 
coefficient, it would be difficult to derive an efficient kernel, even though our 
coefficients are related to the magic number 7 (©). However, one can have 
certain combinations of coefficients that would still make K easy to compute. 
For instance, set parameters y > 0 and ¢ > 0, and consider the transform 


(x) = (6-1, VC: 21, VIIC- 225- ANDIE a, 
Y: ssy Y wie, N Lata), 


then K(x, x’) = (¢ + yx"x')?, which is also easy to compute. The resulting 
kernel K is often called the second-order polynomial kernel. 

At first glance, the freedom to choose (y,¢) and still get an efficiently- 
computable kernel looks useful and also harmless. Multiplying each feature 
in the Z space by different constants does not change the expressive power 
of linear classifiers in the Z space. Nevertheless, changing (y, Ç) changes the 
geometry in the Z space, which affects distances and hence the margin of a 
hyperplane. Thus, different (y, Ç) could result in a different optimal hyperplane 
in the Z space since the margins of all the hyperplanes have changed, and this 
may give a different quadratic separator in the X space. The following figures 
show what happens on some artificial data when you vary y with ¢ fixed at 1. 














x x 

P x vie ie i. : 
x 
[E| 

(a. 

(1 + 0.001x7x’)? 1+ (x"x’) + (x?x’)? (1 + 1000x*x’)? 


























We see that the three quadratic curves are different, and so are their support 
vectors denoted by the squares. It is difficult, even after some ‘forbidden’ 
visual snooping (©), to decide which curve is better. One possibility is to 
use the Eey bound based on the number of support vectors — but keep in 
mind that the Eey bound can be rather loose in practice. Other possibilities 
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include using the other validation tools that we have introduced in Chapter 4 
to choose y or other parameters in the kernel function. The choice of kernels 
and kernel parameters is quite important for ensuring a good performance of 
nonlinear SVM. Some simple guidelines for popular kernels will be discussed 
in Section 8.3.2. 

The derivation of the second-order poly- 
nomial kernel above can be extended to the 
popular degree-Q polynomial kernel 





K(x,x!) = (C+9x"x')?, 

where y > 0, € > 0, and Q €N.° Then, with 
the kernel trick, hard-margin SVM can learn 
sophisticated polynomial boundaries of dif- 
ferent orders by using exactly the same pro- 
cedure and just plugging in different polyno- 
mial kernels. So, we can efficiently use very 
high-dimensional kernels while at the same time implicitly control the model 
complexity by maximizing the margin. The side figure shows a 10-th order 
polynomial found by kernel-SVM with margin 0.1. 














Gaussian-RBF Kernel. Another popular kernel is called the Gaussian- 
RBF kernel,® which has the form 
K(x,x’) = exp (—y||x = x'||*) 
for some y > 0. Let us take y = 1 and take x to be a scalar x € R in order to 
understand the transform ® implied by the kernel. In this case, 
K(a,2") = exp (=||x = 2'||) 
= exp (—(z)’) -exp(2r2") - exp (—(2’)’) 
O° 2* (x) (a!) 
= exp (—(x)’) . (>. ~p "exp (—(2')?) ’ 
k=0 


which is equivalent to an inner product in a feature space defined by the 
nonlinear transform 


91 92 33 
(xr) = exp(—2”)- (. 4] T? 4/ g7) 4/ g la ) : 


We got this nonlinear transform by splitting each term in the Taylor series of 
K(x, x’) into identical terms involving x and x’. Note that in this case, ® is 


5We stick to the notation Q for the order of a polynomial, not to be confused with the 
matrix Q in quadratic programming. 

SRBF comes from the radial basis function introduced in Chapter 6. We use the param- 
eter y in place of the scale parameter 1/r, which is common in the context of SVM. 
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an infinite-dimensional transform. Thus, a direct computation of (x)(x) 
is not possible in this case. Nevertheless, with the kernel trick, hard-margin 
SVM can find a hyperplane in the infinite dimensional Z space with model 
complexity under control if the margin is large enough. 

The parameter y controls the the width of the Gaussian kernel. Differ- 
ent choices for the width y correspond to different geometries in the infinite- 
dimensional Z space, much like how different choices for (y,¢) in the polyno- 
mial kernel correspond to different geometries in the polynomial Z space. The 
following figures show the classification results of three Gaussian-RBF kernels 
only differing in their choice of y. 
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exp(—1||x — x'||) 


exp(—10||x — x’ ||?) 


exp(—100||x — x’ ||?) 


When the Gaussian functions are narrow (large y), we clearly see that even 
the protection of a large margin cannot suppress overfitting. However, for 
a reasonably small y, the sophisticated boundary discovered by SVM with 
the Gaussian-RBF kernel looks quite good. Again, this demonstrates that 
kernels and kernel parameters need to be carefully chosen to get reasonable 
performance. 


Exercise 8.15 


Consider two finite-dimensional feature transforms ®; and ə and their 
corresponding kernels Kı and Ko. 
(a) Define ®(x) = (®1(x), ®2(x)). 
® in terms of Kı and Ko. 
(b) Consider the matrix ®;(x)®2(x)" and let ®(x) be the vector repre- 


sentation of the matrix (say, by concatenating all the rows). Express 
the corresponding kernel of ® in terms of Kı and Ko. 


Express the corresponding kernel of 


(c) Hence, show that if Kı and K2 are kernels, then so are Kı + K2 and 
ky Ko. 


The results above can be used to construct the general polynomial kernels 
and (when extended to the infinite-dimensional transforms) to construct 
the general Gaussian-RBF kernels. 
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8.3.2 Choice of Kernels 


Three kernels are popular in practice: linear, polynomial and Gaussian-RBF. 
The linear kernel, which corresponds to a special polynomial kernel with Q = 
1, y = 1, and ¢ = 0, corresponds to the identity transform. Solving the SVM 
dual problem (8.22) with the linear kernel is equivalent to solving the linear 
hard-margin SVM (8.4). Many special SVM packages utilize the equivalence 
to find the optimal (b*, w*) or a* more efficiently. One particular advantage of 
the linear hard-margin SVM is that the value of w¥ can carry some explanation 
on how the prediction g(x) is made, just like our familiar perceptron. One 
particular disadvantage of linear hard-margin SVM is the inability to produce 
a sophisticated boundary, which may be needed if the data is not linearly 
separable. 

The polynomial kernel provides two controls of complexity: an explicit con- 
trol from the degree Q and an implicit control from the large-margin concept. 
The kernel can perform well with a suitable choice of (y,¢) and Q. Never- 
theless, choosing a good combination of three parameters is not an easy task. 
In addition, when Q is large, the polynomial kernel evaluates to a value with 
either very big or very small magnitude. The large range of values introduces 
numerical difficulty when solving the dual problem. Thus, the polynomial 
kernel is typically used only with degree Q < 10, and even then, only when 
Ç + yx"x’ are scaled to reasonable values by appropriately choosing (Ç, y). 

The Gaussian-RBF kernel can lead to a sophisticated boundary for SVM 
while controlling the model complexity using the large-margin concept. One 
only needs to specify one parameter, the width y, and its numerical range is 
universally chosen in the interval [0,1]. This often makes it preferable to the 
polynomial and linear kernels. On the down side, because the corresponding 
transform of the Gaussian-RBF kernel is an infinite-dimensional one, the re- 
sulting hypothesis can only be expressed by the support vectors rather than 
the actual hyperplane, making it difficult to interpret the prediction g(x). In- 
terestingly, when coupling SVM with the Gaussian-RBF kernel, the hypothesis 
contains a linear combination of Gaussian functions centered at x,,, which can 
be viewed as a special case of the RBF Network introduced in e-Chapter 6. 

In addition to the above three kernels, there are tons of other kernel choices. 
One can even design new kernels that better suit the learning task at hand. 
Note, however, that the kernel is defined as a shortcut function for computing 
inner products. Thus, similar to what’s shown in Exercise 8.11, the Gram 
matrix defined by 


K(x1,%1) K(xi1,xX2) ... K(xi,xwn) 
K = K(x2,x1) K(xe,x2) ... K(xe,xwn) 
K(xn,X1) K(xn,X2) ... K(xn,xn) 


must be positive semi-definite. It turns out that this condition is not only nec- 
essary but also sufficient for K to be a valid kernel function that corresponds 
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(a) linear classifier (b) Gaussian-RBF kernel 


Figure 8.9: For (a) a noisy data set that linear classifier appears to work 
quite well, (b) using the Gaussian-RBF kernel with the hard-margin SVM 
leads to overfitting. 


to the inner product in some nonlinear Z space. This requirement is formally 
known as Mercer’s condition. 


K(x, x’) is a valid kernel function if and only if 


the kernel matrix K is always symmetric PSD for any 
given {x1,--- ,xw}. 





The condition is often used to rule out invalid kernel functions. On the other 
hand, proving a kernel function to be a valid one is a non-trivial task, even 
when using Mercer’s condition. 


8.4 Soft-margin SVM 


The hard-margin SVM assumes that the data is separable in the Z space. 
When we transform x to a high-dimensional Z space, or an infinite-dimensional 
one using the Gaussian-RBF kernel for instance, we can easily overfit the data.” 
Figure 8.9(b) depicts this situation. The hard-margin SVM coupled with the 
Gaussian-RBF kernel insists on fitting the two ‘outliers’ that are misclassified 
by the linear classifier, and results in an unnecessarily complicated decision 
boundary that should be ringing your ‘overfitting alarm bells’. 

For the data in Figure 8.9(b), we should use a simple linear hyperplane 
rather than the complex Gaussian-RBF separator. This means that we will 
need to accept some errors, and the hard-margin SVM cannot accommodate 


“It can be shown that the Gaussian-RBF kernel can separate any data set (with no 
repeated points xn). 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:8—40 











e-8. SUPPORT VECTOR MACHINES 8.4. SOFT-MARGIN SVM 


that since it is designed for perfect classification of separable data. One remedy 
is to consider a ‘soft’ formulation: try to get a large-margin hyperplane, but 
allow small violation of the margins or even some classification errors. As we 
will see, this formulation controls the SVM model’s sensitivity to outliers. 

The most common formulation of the (linear) soft-margin SVM is as fol- 
lows. Introduce an ‘amount’ of margin violation én > 0 for each data point 
(Xn, Yn) and require that 


Yn (WX, +b) > 1— Eq. 


According to our definition of separation in (8.2), (Xn, Yn) is separated if 
Yn (w'x, +b) > 1, so n captures by how much (Xn, Yn) fails to be sepa- 
rated. In terms of margin, recall that if yn (w’x, + b) = 1 in the hard-margin 
SVM, then the data point is on the boundary of the margin. So, én cap- 
tures how far into the margin the data point can go. The margin is not hard 
anymore, it is ‘soft’, allowing a data point to penetrate it. Note that if a 
point (Xn, Yn) satisfies yn (w?x, + b) > 1, then the margin is not violated and 
the corresponding €, is defined to be zero. Ideally, we would like the total 
sum of margin violations to be small, so we modify the hard-margin SVM 
to the soft-margin SVM by allowing margin violations but adding a penalty 
term to discourage large violations. The result is the soft-margin optimization 
problem: 


N 
. 1 
min yw w + Pa Én (8.30) 


subject to Yn (w'x, +b) 2 1 — €, for n = 1,2,..., N; 
En > 0 for w= 1,2,..., N. 


We solve this optimization problem to obtain (b, w). Compared with (8.4), 
the new terms are highlighted in red. We get a large margin by minimizing the 


term 4w"w in the objective function. We get small violations by minimizing 


the term = En. By minimizing the sum of these two terms, we get a 
compromise between our two goals, and that compromise will favor one or the 
other of our goals (large margin versus small margin violations) depending on 
the user-defined penalty parameter denoted by C. When C is large, it means 
we care more about violating the margin, which gets us closer to the hard- 
margin SVM. When C is small, on the other hand, we care less about violating 
the margin. By choosing C appropriately, we get a large-margin hyperplane 
(small effective dy.) with a small amount of margin violations. 

The new optimization problem in (8.30) looks more complex than (8.4). 
Don’t panic. We can solve this problem using quadratic programming, just as 
we did with the hard-margin SVM (see Exercise 8.16 for the explicit solution). 
Figure 8.10 shows the result of solving (8.30) using the data from Figure 8.6(a). 
When C is small, we obtain a classifier with very large margin but with many 
cases of margin violations (some cross the boundary and some don’t); when 
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(b) C = 500 


Figure 8.10: The linear soft-margin optimal hyperplane algorithm from 
Exercise 8.16 on the non-separable data in Figure 8.6(a). 


C is large, we obtain a classifier with smaller margin but with less margin 
violation. 


Exercise 8.16 
Show that the optimization problem in (8.30) is a QP-problem. 
b £ 
(a) Show that the optimization variable is u = [e] , where € = | : | : 
EN 
(b) Show that u* +— QP(Q, p, A,c), where 
0 o Uy o Yx I 1 
Q= [Be eis |o p= [28] A= [onan Bt] e= BN, 
and YX is the signed data matrix from Exercise 8.4. 
(c) How do you recover b*, w* and &* from u*? 


(d) How do you determine which data points violate the margin, which 
data points are on the edge of the margin and which data points are 
correctly separated and outside the margin? 


Similar to the hard-margin optimization problem (8.4), the soft-margin ver- 
sion (8.30) is a convex QP-problem. The corresponding dual problem can be 
derived using the same technique introduced in Section 8.2. We will just pro- 
vide the very first step here, and let you finish the rest. For the soft-margin 
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SVM (8.30), the Lagrange function is 


E(B, W, £, a, 3) 
1 N N N 
= gwwtC Dk + So an(1-& — Yn(w'x,, + b)) = Babes 
n=1 n=l n=1 


where a, > 0 are the Lagrange multipliers for y,(w7x, + b) > 1 — én, and 
Bn > 0 are the Lagrange multipliers for €, > 0. Then, the KKT condition 
tells us that at the optimal solution, 0£/0€, has to be 0. This means 


C — an — Bn = 0. 


That is, a, and 8, sum to C. We can then replace all 6n by C — an without 
loss of optimality. If we do so, the Lagrange dual problem simplifies to 


max min L(b,w, £, a 
a>0,8>0, b,w,é (b, w£, a), 
An+Bn=C 


where 


N N N 
L = iww + cy En t 5 An (l — En — Yn(wXn + b)) — ye — On)En 
n=1 n=1 


n=1 
1 N 
= au wW + 2, on(t — Yn(W" Xp + b)). 


Does the simplified objective function now look familiar? It is just the objec- 
tive function that led us to the dual of the hard-margin SVM! ©) We trust 
that you can then go through all the steps as in Section 8.2 to get the whole 
dual. When expressed in matrix form, the dual problem of (8.30) is 


1 
min -a’Qpa—1*a 
a 2 
subject to y'a=0;. 
O0<a<C-l. 


Interestingly, compared with (8.22), the only change of the dual problem is 
that each a, is now upper-bounded by C, the penalty rate, instead of oo. 
The formulation can again be solved by some general or specifically tailored 
quadratic programming packages. 

For the soft-margin SVM, we get a similar expression for the final hypoth- 
esis in terms of the optimal dual solution, like that of the hard-margin SVM. 


g(x) = sign 5 Ynax,, x + b* 
at >0 
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The kernel trick is still applicable as long as we can calculate the optimal b* 
efficiently. Getting the optimal b* from the optimal a*, however, is slightly 
more complicated in the soft-margin case. The KKT conditions state that 


ar- (un Ww"? Xn +5") — 1+ En) = Ü; 
Pa En = (Can) En = O. 


If a% > 0, then (Yn(w*"Xn +b*)-1+ és) = 0 and hence 
Yn(w*'x, +0*) =1-&* <1. 
On the other hand, if až < C, then é = 0 and hence 
Yn(w*' Xn + 6*) > 1. 


The two inequalities gives a range for the optimal b*. When there is a support 
vector with 0 < až < C, we see that the inequalities can be combined to an 
equality yn(w**x,, + b*) = 1, which can be used to pin down the optimal b* 
as we did for the hard-margin SVM. In other cases, there are many choices of 
b* and you can freely pick any. 

The support vectors with 0 < až < C are called the free support vectors, 
which are guaranteed to be on the boundary of the fat-hyperplane and hence 
also called margin support vectors. The support vectors with a*, = C, on the 
other hand, are called the bounded support vectors (also called non-margin 
support vectors). They can be on the fat boundary, slightly violating the 
boundary but still correctly predicted, or seriously violating the boundary 
and erroneously predicted. 

For separable data (in the ®-transformed space), there exists some C” 
such that whenever C > C’, the soft-margin SVM produces exactly the same 
solution as the hard-margin SVM. Thus, the hard-margin SVM can be viewed 
as a special case of the soft-margin one, illustrated below. 





Soft-margin Hard-margin 
small C medium Č large C 
œ~ [e] o 
~x =x x x 
x x x x x x x x 
x x x x 
o o o o 
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Let’s take a closer look at the parameter C in the soft-margin SVM formula- 
tion (8.30). If C is large, it means that we do want all violations (errors) én 
to be as small as possible with the possible trade-off being a smaller margin 
(higher complexity). If C is small, we will tolerate some amounts of errors, 
while possibly getting a less complicated hypothesis with large margin. Does 
the trade-off sound familiar? We have encountered such a trade-off in Chap- 
ter 4 when we studied et Let 


Esvu(b, w) Da max(1 — yn(w™Xn + b), 0). 
n=l 
The n-th term in Esvm(b, w) evaluates to 0 if there is no violation from the 
n-th example, so y,(w*x, + b) > 1; otherwise, the nth term is the amount 
of violation for the corresponding data point. Therefore, the objective that 
we minimize in soft-margin SVM (8.30) can be re-written as the following 
optimization problem 


min Aww + Esvu(b, w), 


subject to the constraints, and where A = 1/2CN. In other words, soft- 
margin SVM can be viewed as a special case of regularized classification with 
Esvu(b, w) as a surrogate for the in-sample error and $w*w (without b) as 
the regularizer. The Esvu(b, w) term is an upper bound on the classification 
in-sample error Fin, while the regularizer term comes from the large-margin 
concept and controls the effective model complexity. 


Exercise 8.17 


Show that Esyu(b, w) is an upper bound on the Fin(b, w), where Ein is 
the classification 0/1 error. 


In summary, soft-margin SVM can: 


1. Deliver a large-margin hyperplane, and in so doing it can control the 
effective model complexity. 


2. Deal with high- or infinite-dimensional transforms using the kernel trick, 


3. Express the final hypothesis g(x) using only a few support vectors, their 
corresponding Lagrange multipliers, and the kernel. 


4. Control the sensitivity to outliers and regularize the solution through 
setting C appropriately. 

When the regularization parameter C and the kernel are chosen properly, 
the soft-margin SVM is often observed to enjoy a low Eou; with the useful 
properties above. These properties make the soft-margin SVM (the SVM for 
short) one of the most useful classification models and often the first choice 
in learning from data. It is a robust linear model with advanced nonlinear 
transform capability when used with a kernel. 
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8.5 Problems 


Problem 8.1 Consider a data set with two data points x+ € RÊ hav- 
ing class +1 respectively. Manually solve (8.4) by explicitly minimizing ||w||? 
subject to the two separation constraints. 








* 


Compute the optimal (maximum margin) hyperplane (b*, w*) and its margin. 


Compare with your solution to Exercise 8.1. 


Problem 8.2 Consider a data set with three data points in R?: 


0 0 -1 
X=| 0 Sh y=|-1 
-2 0 +1 


Manually solve (8.4) to get the optimal hyperplane (b*, w*) and its margin. 


Problem 8.3 Manually solve the dual optimization from Example 8.8 to 
obtain the same a” that was obtained in the text using a QP-solver. Use the 
following steps. 


(a) Show that the dual optimization problem is to minimize 
L(a)= 4az +203 + Zai — 4a2Q3 — Q204 + 6a3Q4 — Q1 — Q2 — Q3 — Q4, 
subject to the constraints 
a, +a2 = &3 + Q4; 
&1, 2,03,a4 > 0. 
(b) Use the equality constraint to replace a; in L(a) to get 
L(a) = Aad $ 203 + 805 — 4aza3 — 6ba2za4 + 6a3a4 — 2a3 — 204. 
(c) Fix a3,a4 > 0 and minimize L(œ) in (b) with respect to a2 to show that 


a3 304 
a2 = — + — and a1 = Q3 + Q4 — Q2 = — + —. 


2 4 2 4 





Are these valid solutions for a1, a2? 


(d) Use the expressions in (c) to reduce the problem to minimizing 
L(a) = 03 + 205 + 3a304 — 203 — 2a4, 


subject to a3, a4 > 0. Show that the minimum is attained when a3 = 1 
and a4 = 0. What are a1, a2? 


It’s a relief to have QP-solvers for solving such problems in the general case! 
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Problem 8.4 Set up the dual problem for the toy data set in Exer- 
cise 8.2. Then, solve the dual problem and compute a”, the optimal Lagrange 
multipliers. 


Problem 8.5 [Bias and Variance of the Optimal Hyperplane] 
In this problem, you are to investigate the bias and variance of the optimal 
hyperplane in a simple setting. The input is (21, x2) € [—1, 1]? and the target 
function is f(x) = sign(z2). 

The hypothesis set H contains horizontal linear separators h(x) = sign(x2—a), 
where —1 <a < 1. Consider two algorithms: 


Random: Pick a random separator from H. 


SVM: Pick the maximum margin separator from H. 


(a) Generate 3 data point uniformly in the upper half of the input-space and 
3 data points in the lower half, and obtain grandom and gsvm- 


(b) Create a plot of your data, and your two hypotheses. 


(c) Repeat part (a) for a million data sets to obtain one million Random and 
SVM hypotheses. 


(d) Give a histogram of the values of Grandom resulting from the random 
algorithm and another histogram of asvm resulting from the optimal sep- 
arators. Compare the two histograms and explain the differences. 


(e) Estimate the bias and var for the two algorithms. Explain your findings, 
in particular which algorithm is better for this toy problem. 


N N 
Problem 8.6 Show that X` ||xn — pl|” is minimized at u = wD Xn. 
n=1 


Problem 8.7 For any X1,...,Xw with ||x,|| < R and N even, show 
that there exists a balanced dichotomy y1,..., Yn that satisfies 
N N 
NR 
n = 0, and nXn|| < ——. 
2) 20 ymer 














(This is the geometric lemma that is need to bound the VC-dimension of p-fat 
hyperplanes by [ R?/p? | +1.) The following steps are a guide for the proof. 


Suppose you randomly select N/2 of the labels y1,..., yn to be +1, the others 
being —1. By construction, 5i Yn = 0. 


N 2 


> YnXn 


n=1 


N N 
= 5 5 YnYmXn Xm. 


n=1lm=1 


(a) Show 
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(b) When n = m, what is ynym? Show that Plynym = 1] = (4—1)/(N-1) 
when n 4 m. Hence show that 


1 m = n; 
Elyn Ym] = f 1 


(c) Show that 


2 
E 


N 
` YnXn 
n=1 


N 

N ao 

= Wood Dy lem — 2’, 
n=l 














where the average vector X = = SAL Xn. [Hint: Use linearity of expec- 
tation in (a), and consider the cases m = n and m£ n separately. ] 


N N 
(d) Show that X` |x» — 3|? < XC ||xn|? < NR? [Hint: Problem 8.6.] 
n=l n=1 


(e) Conclude that 
































N 
N2R?2 

E nan < 7 

and hence that 
N 
NR 
P nXn|| < ———— | > 0. 

POS 

This means for some choice of yn, pit YnXn|| < NR/VN -1 











This proof is called a probabilistic existence proof: if some random process can 
generate an object with positive probability, then that object must exist. Note 
that you prove existence of the required dichotomy without actually construct- 
ing it. In this case, the easiest way to construct a desired dichotomy is to 
randomly generate the balanced dichotomies until you have one that works. 


Problem 8.8 We showed that if N points in the ball of radius R are 
shattered by hyperplanes with margin p, then N < R?/p” +1 when N is even. 


Now consider N odd, and x1,...,xw with ||x,|| < R shattered by hyperplanes 
with margin p. Recall that (w, b) implements yi,..., yn with margin p if 
p|lwl| < yn(w'xn + b), for n=1,..., N. (8.31) 


Show that for N = 2k + 1 (odd), N < R?/p? + $ +1 as follows: 


Consider random labelings y1,..., yn of the N points in which k of the labels 


are +1 and k+ 1 are —1. Define ln = = if yn = +1 and fn = Gy 





if yn = —1. 


(a) For any labeling with k labels being +1, show, by summing (8.31) and 
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using the Cauchy-Schwarz inequality, that 


N 


Ses 


n=l 


2p < 














(b) Show that there exists a labeling, with k labels being +1, for which 


N 
5 LnYnXn 
n=1 


P 2NR 
FCN =a Noe 














2 N N 
(i) Show = 5 5 LnbmYnYm&,Xm- 
=lm=1 


, , A i 
(ii) For m = n, show EflnlmyYnym] = EFI 

1 
(N—1)k(k+1) 


[Hint: Plalmynym = 1/k?] = k(k — 1)/N(N — 1).] 


N 2 N N 
5 LnYnXn -Na 14/2. 5 [xn T lige 
= (N= 1)k(k +1) 4 

(v) Use Problem 8.6 to conclude the proof as in Problem 8.7. 


=1 
RT 
(c) Use (a) and (b) to show that N < ez + K ara 


N 
5 LnYnXn 


n=1 
































(iii) For m 4 n, show E[lnLmynYm| = — 














(iv) Show E 














Problem 8.9 Prove that for the separable case, if you remove a data 
point that is not a support vector, then the maximum margin classifier does not 
change. You may use the following steps are a guide. Let g be the maximum 
margin classifier for all the data, and g` the maximum margin classifier after 
removal of a data point that is not a support vector. 


(a) Show that g is a separator for D`, the data minus the non-support vector. 
(b) Show that the maximum margin classifier is unique. 


(c) Show that if g has larger margin than g on D, then it also has larger 
margin on D, a contradiction. Hence, conclude that g is the maximum 
margin separator for D`. 


Problem 8.10 An essential support vector is one whose removal from 
the data set changes the maximum margin separator. For the separable case, 
show that there are at most d+ 1 essential support vectors. Hence, show that 


for the separable case, 
d+1 


Ew < —. 
Z N 
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Problem 8.11 Consider the version of the PLA that uses the misclassified 
data point x, with lowest index n for the weight update. Assume the data is 
separable and given in some fixed but arbitrary order, and when you remove a 
data point, you do not alter this order. In this problem, prove that 


R? 


cv < ; 
Ew (PLA) < Ve 





where p is the margin (half the width) of the maximum margin separating 
hyperplane that would (for example) be returned by the SVM. The following 
steps are a guide for the proof. 


(a) Use the result in Problem 1.3 to show that the number of updates T that 
the PLA makes is at most 
R? 
(b) Argue that this means that PLA only ‘visits’ at most R?/p? different 
points during the course of its iterations. 


(c) Argue that after leaving out any point (Xn, yn) that is not ‘visited’, PLA 
will return the same classifier. 


(d) What is the leave-one-out error en for these points that were not visited? 
Hence, prove the desired bound on E.y(PLA). 


Problem 8.12 Show that optimal solution for soft-margin optimal hy- 
perplane (solving optimization problem (8.30)) with C > oo will be the same 
solution that was developed using linear programming in Problem 3.6(c). 


Problem 8.13 The data for Figure 8.6(b) are given below: 


Yn = +1 Yn = —1 Use the data on the left with the 2nd 
and 3rd order polynomial transforms 
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0.491, 0.920 . 
—0.494, 0.363) Toen ®2,@3 and the pseudo-inverse algo- 
—0.311, —0.101) —0.721, —0.710) rithm for linear regression from Chap- 


—0.0064, 0.374) 
—0.0089, —0.173) 
0.0014, 0.138) 
—0.189, 0.718) 
0.085, 0.32208) 
0.171, —0.302) 
0.142, 0.568) 





( 

( 

( 

(0.519, —0.715) 
(—0.775, 0.551) 
(—0.646, 0.773) 
(—0.803, 0.878) 
(0.944, 0.801) 
(0.724, —0.795) 
(—0.748, —0.853) 
(—0.635, —0.905) 


ter 3 to get weights w for your final 
final hypothesis in Z-space. The final 
hypothesis in V-space is: 


g(x) = sign (w(x) + b). 


(a) Plot the classification regions for your final hypothesis in ¥-space. Your 
results should look something like: 
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(b) Which of fits in part (a) appears to have overfitted? 


(c) Use the pseudo-inverse algorithm with regularization parameter A = 1 
to address the overfitting you identified in part (c). Give a plot of the 
resulting classifier. 


Problem 8.14 The kernel trick can be used with any model as long 
as fitting the data and the final hypothesis only require the computation of 
dot-products in the Z-space. Suppose you have a kernel K, so 


®(x)"®(x’) = K(x,x’). 


Let Z be the data in the Z-space. The pseudo-inverse algorithm for regularized 
regression computes optimal weights w* (in the Z-space) that minimize 


Baug(W) = ||Zw — y||? + Aww. 


The final hypothesis is g(x) = sign(w"®(x)). Using the representor theorem, 
the optimal solution can be written w* = eae Bran = Z" B* 


(a) Show that 8* minimizes 
E(B) = ||KB — yll? + B"KB, 
where K is the N x N Kernel-Gram matrix with entries K;; = K (xi, x;). 


(b) Show that K is symmetric. 


(c) Show that the solution to the minimization problem in part (a) is: 
BY = (K +A) y. 


Can 8* be computed without ever ‘visiting’ the Z-space? 


(d) Show that the final hypothesis is 


g(x) = sign (>: BK ; 
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Problem 8.15 Structural Risk Minimization (SRM). SRM is 
a useful framework for model selection. A structure is a nested sequence of 
hypothesis sets: 


Go DAD m ) ms 


Suppose we use in-sample error minimization as our learning algorithm in each 
Hm, SO gm = argmin Ein(h), and select g* = argmin Ein (gm) + Q(Hm)- 
hEHm Im 
(a) Show that the in-sample error Ein(gm) is non-increasing in m. What 
about the penalty Q(Hm)? How do you expect the VC-bound to behave 
with m. 


(b) Assume that g* € Hm with a priori probability pm. (In general, the pm 
are not known.) Since Hm C Hm+1, po < pi < p2 < +++ < 1. What 
components of the learning problem do the pm's depend on. 





(c) Suppose g* = gm E€ Hm. Show that 
* 1 zg 
P [|Eim(gi) = Bove(ai)| > €| 9° = gm] $+ Am, (2N)e7® M. 


Here, the conditioning is on selecting the function gm € Hm. [Hint: 
Bound the probability P [maxgen; |Ein(g) — Eout(g)| > € | g* = gm]. 
Use Bayes theorem to rewrite this as = P[maxgen, |Ein(g) — Eout(g)| > 
€ and g* = gm]. Use the fact that P[A and B] < P[A]. and argue that 
you can apply the familiar VC-inequality to the resulting expression. ] 


You may interpret this result as follows: if you use SRM and end up with gm, 
then the generalization bound is a factor — worse than the bound you would 
have gotten had you simply started with Hm; that is the price you pay for 
allowing yourself the possibility to search more than Hm. Typically simpler 
models occur earlier in the structure, and so the bound will be reasonable if the 
target function is simple (in which case pm is large for small m). SRM works 
well in practice. 


Problem 8.16 Which can be posed within the SRM framework: selection 
among different soft order constraints {Hc}cso or selecting among different 
regularization parameters {71}, +0 where the hypothesis set fixed at H and the 
learning algorithm is augmented error minimization with different regularization 
parameters À. 
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Problem 8.17 Suppose we use “SRM” to select among an arbitrary set of 
models Hi,..., Hm with dyc(Hm+i) > dvc(Hm) (as opposed to a structure 
in which the additional condition Hm C Hm-+1 holds). 


(a) Is it possible for Fin(Hm) < Ein(Hm+1)? 


(b) Let pm be the probability that the process leads to a function gm € Hm, 
with X`, Pm = 1. Give a bound for the generalization error in terms of 


dyvc(Hm). 


Problem 8.18 Suppose that we can order the hypotheses in a model, 
H = {hi,ho,...}. Assume that dyc(H) is infinite. Define the hypothesis 
subsets Hm = {hi, h2,...,hm}. Suppose you implement a learning algorithm 
for H as follows: start with hi; if Ein(hi) < v, stop and output h1; if not try 
h2; and soon... 


(a) Suppose that you output hm, so you have effectively only searched the m 
hypotheses in Hm. Can you use the VC-bound: (with high probability) 


Eout(hm) <u +4/ mG@m/9) 9 If yes, why? If no, why not? 


(b) Formulate this process within the SRM framework. [Hint: the Hm 's form 
a structure.] 


(c) Can you make any generalization conclusion (remember, dyc(H.) = œ)? 
If yes, what is the bound on the generalization error, and when do you 
expect good generalization? If no, why? 
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